Random Forests

#decission #ensamble #forest #learning #machine #ml #notes #python #random #sklearn #tensorflow

A Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate predictor. It leverages the idea of "wisdom of the crowd" where averaging predictions from diverse trees reduces variance and improves generalization.

  • Uses bagging (bootstrap aggregating) to train individual decision trees on random subsets of data with random feature selection at each split.
  • Each tree votes on the final prediction, with the majority class being the ensemble's prediction (for classification) or the average prediction (for regression).
  • Enhances model stability and reduces overfitting compared to a single decision tree.
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier with 100 trees
clf = RandomForestClassifier(n_estimators=100)

# Train the model on your data
clf.fit(X_train, y_train)

# Make predictions on new data
predictions = clf.predict(X_test)

Bagging

A technique for reducing variance in machine learning models by training multiple models on different subsets of data with replacement (i.e., some samples may appear in multiple subsets). By averaging the predictions from these models, the overall prediction is less sensitive to fluctuations in the training data.

  • Creates diversity among the base models by using different data subsets.
  • Particularly effective for reducing variance in unstable models like decision trees.
  • Used in Random Forests and other ensemble methods.
# Same as above XD

Boosting

A sequential ensemble method where each new model in the sequence learns from the errors of the previous model. The goal is to progressively improve the overall prediction accuracy by focusing on the data points that the earlier models misclassified.

  • Models are built sequentially, with each model fitting the residuals (errors) of the previous model.
  • Weights are assigned to data points, increasing weights for misclassified points to give them more focus in subsequent models.
  • Can be more prone to overfitting than bagging if not carefully tuned.
  • Examples of boosting algorithms include AdaBoost and Gradient Boosting.
from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting classifier with 100 trees and a learning rate of 0.1
gbr = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

# Train the model on your data
gbr.fit(X_train, y_train)

# Make predictions on new data
predictions = gbr.predict(X_test)

Gradient Boosting

A specific type of boosting algorithm that uses a gradient descent-like approach to minimize loss function. It iteratively builds decision trees, focusing on correcting the errors of the previous trees.

  • Utilizes decision trees as weak learners (simple models).
  • Employs a loss function to measure prediction errors.
  • Each subsequent tree attempts to correct the errors made by the previous one.
  • Popular algorithm for regression and classification tasks.
    The provided Python code demonstrates Gradient Boosting using GradientBoostingClassifier. This is a powerful and versatile ensemble method.

Choosing Between Bagging and Boosting:

  • If reducing variance (model instability) is your primary concern, bagging is a good choice.
  • If you need a more powerful model that can potentially achieve lower bias (underfitting), boosting might be preferable. However, be cautious of overfitting and tune the learning rate carefully.